Introduction to Data

Spring 2026

To start class

  1. Clone your week 2 repo to your machine.
  2. Open a file to keep notes in - either a qmd or script.
  3. Ensure file is saved in the project folder per week.
  4. Repeat.

What the **** are data?

Data are a means to represent the world

Context Matters

Semantics

  • Row = observation, record, case, example, instance, pattern, sample
  • Columns = variable, field, feature, attribute, input, predictor, dimension

Categorical Data

  • Nominal: Unordered category
  • Ordinal: Ordered category
  • Both can be binary or multinomial

Numeric Data

  • Continuous: Can take on any number
    • Interval: Distance between values are equal and meaningful
      • Numbers are ‘arbitrary’ and lack a 0 point
      • IQ, temperature, etc.
    • Ratio: Defined 0 point. Cannot fall below 0.
  • Discrete: Can only take on certain numbers. There are ‘gaps’ between numbers.
    • Counts & Integers (whole numbers)

A Note about Research Design

  • Qualitative Research: Descriptive statements to seek answers
  • Quantitative research: measurements to seek answers from qualitative or quantitative data
    • Data Science
  • Less precise: Qualitative / categorical
  • More precise: Quantitative / continuous

Data Types & R

Common Data Types

Type Definition Example
Double Whole or floating number 5 or 5.73
Integer Whole number 5, 2, 3L
Character Individual or strings of non-numbers “c”, “cat”, “cat in the hat”
Factor Categorical or discrete variables M/F, S/M/L
Boolean Binary Categories T/F

Data Type in R

Numbers

[1]  4.12  4.57  5.00 17.00

Characters

[1] "M"       "male"    "F"       "cat"     "Cat-Dog"

Factors

[1] M M F M
Levels: F M

Boolean

[1]  TRUE FALSE  TRUE FALSE

Special Data Types




NULL
[1] NA
[1] NaN
[1] Inf

Data Modes

Each variable / object has a data mode that umbrellas by data type.

Numeric: * Both integers and doubles * Includes factors

Character: * Characters and strings

Logical: * Boolean TRUE and FALSE

The mode() function returns the type of data mode.

mode(42)
[1] "numeric"
a <- "beer"
mode(a)
[1] "character"
mode(T)
[1] "logical"
mode(as.factor("M"))
[1] "numeric"

Checking and Converting




is.numeric(2)
[1] TRUE
is.numeric(a)
[1] FALSE
is.character("a")
[1] TRUE
as.character(4)
[1] "4"
as.numeric(4)
[1] 4
  • It is often useful to make discrete categories (strings) into a factor. This allows for ease in analyses and visualizations.
fac <- c("M", "F")
mode(fac)
[1] "character"
fac <- factor(fac)
fac
[1] M F
Levels: F M
is.factor(fac)
[1] TRUE

Data Structures

But First! Importing Data!

  • Option 1: Environment Pane > Import
    • Point and click
  • Option 2: Code!
data <- read.csv(...) # Base R

library(readr)
read_csv(...)

library(readxl)
read_xls(...)
dat <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2025/2025-04-01/pokemon_df.csv')

Rows: 949
Columns: 22
$ id              <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,…
$ pokemon         <chr> "bulbasaur", "ivysaur", "venusaur", "charmander", "cha…
$ species_id      <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,…
$ height          <dbl> 0.7, 1.0, 2.0, 0.6, 1.1, 1.7, 0.5, 1.0, 1.6, 0.3, 0.7,…
$ weight          <dbl> 6.9, 13.0, 100.0, 8.5, 19.0, 90.5, 9.0, 22.5, 85.5, 2.…
$ base_experience <dbl> 64, 142, 236, 62, 142, 240, 63, 142, 239, 39, 72, 178,…
$ type_1          <chr> "grass", "grass", "grass", "fire", "fire", "fire", "wa…
$ type_2          <chr> "poison", "poison", "poison", NA, NA, "flying", NA, NA…
$ hp              <dbl> 45, 60, 80, 39, 58, 78, 44, 59, 79, 45, 50, 60, 40, 45…
$ attack          <dbl> 49, 62, 82, 52, 64, 84, 48, 63, 83, 30, 20, 45, 35, 25…
$ defense         <dbl> 49, 63, 83, 43, 58, 78, 65, 80, 100, 35, 55, 50, 30, 5…
$ special_attack  <dbl> 65, 80, 100, 60, 80, 109, 50, 65, 85, 20, 25, 90, 20, …
$ special_defense <dbl> 65, 80, 100, 50, 65, 85, 64, 80, 105, 20, 25, 80, 20, …
$ speed           <dbl> 45, 60, 80, 65, 80, 100, 43, 58, 78, 45, 30, 70, 50, 3…
$ color_1         <chr> "#78C850", "#78C850", "#78C850", "#F08030", "#F08030",…
$ color_2         <chr> "#A040A0", "#A040A0", "#A040A0", NA, NA, "#A890F0", NA…
$ color_f         <chr> "#81A763", "#81A763", "#81A763", NA, NA, "#DE835E", NA…
$ egg_group_1     <chr> "monster", "monster", "monster", "monster", "monster",…
$ egg_group_2     <chr> "plant", "plant", "plant", "dragon", "dragon", "dragon…
$ url_icon        <chr> "//archives.bulbagarden.net/media/upload/7/7b/001MS6.p…
$ generation_id   <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ url_image       <chr> "https://raw.githubusercontent.com/HybridShivam/Pokemo…

Data Structures

Vector == 1 Dimension

  • All elements have the same mode.
pokemon type_1
bulbasaur grass
ivysaur grass
venusaur grass
charmander fire
charmeleon fire
charizard fire
  • The c() function combines arguments into a vector.
v <- c(1, 2, 3)
w <- c(10, 11, 12)
c(v, w)
[1]  1  2  3 10 11 12
vec <- c("this", "is", "a", "vector")
vec
[1] "this"   "is"     "a"      "vector"

Matrix == 2 dimensions, same mode

Vectors combined together.

pokemon type_1 type_2
bulbasaur grass poison
ivysaur grass poison
venusaur grass poison
charmander fire NA
charmeleon fire NA
charizard fire flying
  • matrix() combines vectors.

matrix(data=NA, nrow=1, ncol=1, byrow=F, dimnames=F)

matrix(c(v, 4, 5, 6), nrow=2, ncol=2, byrow=T)
     [,1] [,2]
[1,]    1    2
[2,]    3    4

Data Frames == 2 dimensions, different modes

  • AKA a tibble
  • Most common data R data structure.
  • Rectangular data [Rows X columns]
  • Each column has a single mode.(AKA column = vector)
pokemon type_1 hp
bulbasaur grass 45
ivysaur grass 60
venusaur grass 80

Arrays and Lists: 3+ Dimensions

  • Array: = Many matrices into 1 object / container.
    • 1 data mode
  • List: = Combination of all data structures
    • Multiple data modes & types
    • Complex but incredibly useful
  • Remember, there is a function for that!
array(...)
data.frame(...)
list(...)

# is.[...] / as.[...]

Note!

  • R is a computer programming language

    • Languages have dialects

      • Primary / Base: Pre-loaded
      • Others: Pull from library()
  • You have to walk before you can run… we will start with base

  • Remember R can only do what you ask it! It is hyper literal

Exploring Data

Take a Peak

  • head(x, n=6): return the first n elements
  • tail(x, n=6): return the last n elements
head(dat, n = 3)
id pokemon species_id height weight base_experience type_1 type_2 hp attack defense special_attack special_defense speed color_1 color_2 color_f egg_group_1 egg_group_2 url_icon generation_id url_image
1 bulbasaur 1 0.7 6.9 64 grass poison 45 49 49 65 65 45 #78C850 #A040A0 #81A763 monster plant //archives.bulbagarden.net/media/upload/7/7b/001MS6.png 1 https://raw.githubusercontent.com/HybridShivam/Pokemon/master/assets/images/001.png
2 ivysaur 2 1.0 13.0 142 grass poison 60 62 63 80 80 60 #78C850 #A040A0 #81A763 monster plant //archives.bulbagarden.net/media/upload/a/a0/002MS6.png 1 https://raw.githubusercontent.com/HybridShivam/Pokemon/master/assets/images/002.png
3 venusaur 3 2.0 100.0 236 grass poison 80 82 83 100 100 80 #78C850 #A040A0 #81A763 monster plant //archives.bulbagarden.net/media/upload/0/07/003MS6.png 1 https://raw.githubusercontent.com/HybridShivam/Pokemon/master/assets/images/003.png
tail(dat, n = 2)
id pokemon species_id height weight base_experience type_1 type_2 hp attack defense special_attack special_defense speed color_1 color_2 color_f egg_group_1 egg_group_2 url_icon generation_id url_image
10146 kommo-o-totem 784 2.4 207.5 270 dragon fighting 75 110 125 100 105 85 #7038F8 #C03028 #8336C5 dragon NA NA NA https://raw.githubusercontent.com/HybridShivam/Pokemon/master/assets/images/10146.png
10147 magearna-original 801 1.0 80.5 120 steel fairy 80 95 115 130 115 65 #B8B8D0 #EE99AC #C5B0C7 no-eggs NA NA NA https://raw.githubusercontent.com/HybridShivam/Pokemon/master/assets/images/10147.png

Check the Structure

  • dim(x): return the dimensions of an object
    • Matrix, data frame, or array
    • rows, columns, (depth)
dim(dat)
[1] 949  22
mat <- matrix(c(1, 2, 3, 4, 5, 6), nrow=3, ncol=2)
dim(mat)
[1] 3 2
## WARNING

a <- c(1, 2, 3, 4)
dim(a)
NULL

Check the Structure

  • Universal check
str(a)
 num [1:4] 1 2 3 4
str(dat)
spc_tbl_ [949 × 22] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ id             : num [1:949] 1 2 3 4 5 6 7 8 9 10 ...
 $ pokemon        : chr [1:949] "bulbasaur" "ivysaur" "venusaur" "charmander" ...
 $ species_id     : num [1:949] 1 2 3 4 5 6 7 8 9 10 ...
 $ height         : num [1:949] 0.7 1 2 0.6 1.1 1.7 0.5 1 1.6 0.3 ...
 $ weight         : num [1:949] 6.9 13 100 8.5 19 90.5 9 22.5 85.5 2.9 ...
 $ base_experience: num [1:949] 64 142 236 62 142 240 63 142 239 39 ...
 $ type_1         : chr [1:949] "grass" "grass" "grass" "fire" ...
 $ type_2         : chr [1:949] "poison" "poison" "poison" NA ...
 $ hp             : num [1:949] 45 60 80 39 58 78 44 59 79 45 ...
 $ attack         : num [1:949] 49 62 82 52 64 84 48 63 83 30 ...
 $ defense        : num [1:949] 49 63 83 43 58 78 65 80 100 35 ...
 $ special_attack : num [1:949] 65 80 100 60 80 109 50 65 85 20 ...
 $ special_defense: num [1:949] 65 80 100 50 65 85 64 80 105 20 ...
 $ speed          : num [1:949] 45 60 80 65 80 100 43 58 78 45 ...
 $ color_1        : chr [1:949] "#78C850" "#78C850" "#78C850" "#F08030" ...
 $ color_2        : chr [1:949] "#A040A0" "#A040A0" "#A040A0" NA ...
 $ color_f        : chr [1:949] "#81A763" "#81A763" "#81A763" NA ...
 $ egg_group_1    : chr [1:949] "monster" "monster" "monster" "monster" ...
 $ egg_group_2    : chr [1:949] "plant" "plant" "plant" "dragon" ...
 $ url_icon       : chr [1:949] "//archives.bulbagarden.net/media/upload/7/7b/001MS6.png" "//archives.bulbagarden.net/media/upload/a/a0/002MS6.png" "//archives.bulbagarden.net/media/upload/0/07/003MS6.png" "//archives.bulbagarden.net/media/upload/7/7d/004MS6.png" ...
 $ generation_id  : num [1:949] 1 1 1 1 1 1 1 1 1 1 ...
 $ url_image      : chr [1:949] "https://raw.githubusercontent.com/HybridShivam/Pokemon/master/assets/images/001.png" "https://raw.githubusercontent.com/HybridShivam/Pokemon/master/assets/images/002.png" "https://raw.githubusercontent.com/HybridShivam/Pokemon/master/assets/images/003.png" "https://raw.githubusercontent.com/HybridShivam/Pokemon/master/assets/images/004.png" ...
 - attr(*, "spec")=
  .. cols(
  ..   id = col_double(),
  ..   pokemon = col_character(),
  ..   species_id = col_double(),
  ..   height = col_double(),
  ..   weight = col_double(),
  ..   base_experience = col_double(),
  ..   type_1 = col_character(),
  ..   type_2 = col_character(),
  ..   hp = col_double(),
  ..   attack = col_double(),
  ..   defense = col_double(),
  ..   special_attack = col_double(),
  ..   special_defense = col_double(),
  ..   speed = col_double(),
  ..   color_1 = col_character(),
  ..   color_2 = col_character(),
  ..   color_f = col_character(),
  ..   egg_group_1 = col_character(),
  ..   egg_group_2 = col_character(),
  ..   url_icon = col_character(),
  ..   generation_id = col_double(),
  ..   url_image = col_character()
  .. )
 - attr(*, "problems")=<externalptr> 

Indexing Data

  • Index: = an address to a value(s)


# Vectors
b <- c(1, 3, 5, 7, 9)
b
[1] 1 3 5 7 9
b[3]
[1] 5
  • Matrices / DF:
    • 2 dimensions == 2 index values
    • ALWAYS ROWS x COLUMNS

id pokemon species_id height weight base_experience type_1 type_2 hp attack defense special_attack special_defense speed color_1 color_2 color_f egg_group_1 egg_group_2 url_icon generation_id url_image
1 bulbasaur 1 0.7 6.9 64 grass poison 45 49 49 65 65 45 #78C850 #A040A0 #81A763 monster plant //archives.bulbagarden.net/media/upload/7/7b/001MS6.png 1 https://raw.githubusercontent.com/HybridShivam/Pokemon/master/assets/images/001.png
2 ivysaur 2 1.0 13.0 142 grass poison 60 62 63 80 80 60 #78C850 #A040A0 #81A763 monster plant //archives.bulbagarden.net/media/upload/a/a0/002MS6.png 1 https://raw.githubusercontent.com/HybridShivam/Pokemon/master/assets/images/002.png
3 venusaur 3 2.0 100.0 236 grass poison 80 82 83 100 100 80 #78C850 #A040A0 #81A763 monster plant //archives.bulbagarden.net/media/upload/0/07/003MS6.png 1 https://raw.githubusercontent.com/HybridShivam/Pokemon/master/assets/images/003.png
dat[2,1]
id
2
dat[2, 1:3]
id pokemon species_id
2 ivysaur 2

Get a Summary

  • summary() returns the summary statistics for an object.
summary(dat[,c(2,4,5)])
   pokemon              height           weight      
 Length:949         Min.   : 0.100   Min.   :  0.10  
 Class :character   1st Qu.: 0.500   1st Qu.:  8.50  
 Mode  :character   Median : 1.000   Median : 28.80  
                    Mean   : 1.228   Mean   : 66.21  
                    3rd Qu.: 1.500   3rd Qu.: 66.60  
                    Max.   :14.500   Max.   :999.90  

Reformatting Data

Names

  • names(): Dataframe or list
  • colnames(): Dataframe or matrix
colnames(dat)
 [1] "id"              "pokemon"         "species_id"      "height"         
 [5] "weight"          "base_experience" "type_1"          "type_2"         
 [9] "hp"              "attack"          "defense"         "special_attack" 
[13] "special_defense" "speed"           "color_1"         "color_2"        
[17] "color_f"         "egg_group_1"     "egg_group_2"     "url_icon"       
[21] "generation_id"   "url_image"      

NB: These can be useful to get index values of columns.

Selecting Columns - 1 Column

Remember, a column of data is a vector with 1 mode.

  • Some functions may require you to input a vector
  • To select only 1 column:
    • Matrix:
      • Index notation - e.g., dat[,5]
    • Dataframe:
      • $ - e.g., dat$hp
      • Index notation 1 - e.g., dat[[9]] or dat[["hp"]]
        • Note - the goal here is to return a vector. You can use index notation such that dat[,5] but it will return a dataframe and not a vector.

head(dat$pokemon)
[1] "bulbasaur"  "ivysaur"    "venusaur"   "charmander" "charmeleon"
[6] "charizard" 
head(dat[["pokemon"]])
[1] "bulbasaur"  "ivysaur"    "venusaur"   "charmander" "charmeleon"
[6] "charizard" 
head(dat[[2]])
[1] "bulbasaur"  "ivysaur"    "venusaur"   "charmander" "charmeleon"
[6] "charizard" 
head(dat[2], n = 3)
pokemon
bulbasaur
ivysaur
venusaur

Selecting Columns 2+

Maintains dataframe-ness or matrix shape

  • Always using Index Notation [ ]
head(dat[c(1,4,5)], n = 2)
id height weight
1 0.7 6.9
2 1.0 13.0
head(dat[,c(1,4,5)], n = 2)
id height weight
1 0.7 6.9
2 1.0 13.0

head(dat[,3:7])
species_id height weight base_experience type_1
1 0.7 6.9 64 grass
2 1.0 13.0 142 grass
3 2.0 100.0 236 grass
4 0.6 8.5 62 fire
5 1.1 19.0 142 fire
6 1.7 90.5 240 fire

summary(dat[,4:7])
     height           weight       base_experience    type_1         
 Min.   : 0.100   Min.   :  0.10   Min.   : 36.0   Length:949        
 1st Qu.: 0.500   1st Qu.:  8.50   1st Qu.: 68.0   Class :character  
 Median : 1.000   Median : 28.80   Median :157.0   Mode  :character  
 Mean   : 1.228   Mean   : 66.21   Mean   :150.5                     
 3rd Qu.: 1.500   3rd Qu.: 66.60   3rd Qu.:184.0                     
 Max.   :14.500   Max.   :999.90   Max.   :608.0                     
dat$type_1 <- as.factor(dat$type_1)

summary(dat[,4:7])
     height           weight       base_experience     type_1   
 Min.   : 0.100   Min.   :  0.10   Min.   : 36.0   water  :126  
 1st Qu.: 0.500   1st Qu.:  8.50   1st Qu.: 68.0   normal :111  
 Median : 1.000   Median : 28.80   Median :157.0   grass  : 84  
 Mean   : 1.228   Mean   : 66.21   Mean   :150.5   bug    : 79  
 3rd Qu.: 1.500   3rd Qu.: 66.60   3rd Qu.:184.0   rock   : 65  
 Max.   :14.500   Max.   :999.90   Max.   :608.0   psychic: 64  
                                                   (Other):420  

Lab Time!